Management summary

This project builds a predictive model that estimates the count of sample cases of wine that were purchased by wine distribution companies after sampling a wine. These cases would be used to provide tasting samples to restaurants and wine stores around the United States. The more sample cases purchased, the more likely is a wine to be sold at a high end restaurant. After an initial variable inspection, data imputation and transformation, three types of count regression models were prepared and compared on test data: Poisson regression, negative binomial regression, and multiple linear regression. Based on regression performance metrics, the best model was suggested and applied on the evaluation dataset.

1. DATA EXPLORATION

The training dataset contains 12795 observations of 16 variables (one index, one response, and 14 predictor variables).
Each record (row) represents a range of parameters of a wine type being sold such as its chemical properties. The continuous response variable TARGET represents the number of cases of wine that are sold as tasting samples to restaurants and wine stores around the United States.

The variables are:

1.1. Univariate analysis

Summaries for the individual variables are provided below.

##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides       FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.1710   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.0310   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.0460   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.0548   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.1530   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.3510   Max.   : 623.00  
##                    NA's   :616        NA's   :638       NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

From the summaries and the chart above we can see that all variables are continuous and that multiple variables have missing data, but the amount of NAs is not very high with the exception of the STARS variable.

A check for near-zero variance did not show a positive result for any variable.

Per-variable distribution analysis is provided below (excluding the INDEX variable, which is immaterial to the analysis and would not be regarded further).

Summary of the findings from the univariate analysis:

INSERT FINDINGS HERE

1.2. Bivariate analysis

The pairwise correlations between the continuous variables are displayed below

The pairwise scatterplots of the most highly correlated variables vs. the response are provided below

Summary of the findings from the univariate analysis:

  1. Most of the variables are uncorrelated with the response, and only a few show moderate correlations with TARGET, namely: STARS, AcidIndex, LabelAppeal, VolatileAcidity, and Alcohol.
  2. There is virtually no collinearity between the predictors (with the exception of STARS and LabelAppeal (r=+0.33), where it is not strong enough to cause concern)

2. DATA PREPROCESSING

2.1. Data cleaning

The exploratory analysis shows that the distributions of many predictors are suspiciously symmetrical around zero. Moreover, these predictor variables cannot physically be negative (e.g. a wine cannot have a negative alcohol or chloride content).
Therefore, we shall test if the negative sign represents an error in the data entry and can be ignored without losing the correlations in the dataset.

A correlation plot below shows the pairwise correlations in the data where the absolute values were taken for the columns: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Sulphates, Alcohol.

We can see comparing to the previous chart that the pairwise correltaion coefficients have hardly changed and have not changed their direction, which confirms the idea of a minus sign being a data entry error.
As an example, compare the scatterplot of Alcohol vs the TARGET for raw and absolute values (a random sample of 5000 observations is displayed):

We can clearly see that removing the minus sign simply shifts the distribution of the Alcohol variable along the X axis and does not affect the relationship with the response variable.

2.2. Data imputation

As shown at the beginning of the exploratory analysis, several variables have missing data. The missing values are imputed using the functions of the mice package.

The chart below shows the density of the imputed data for each variable is showed in magenta while the density of the observed data is showed in blue. We can see that the imputation does not result in a very different distribution and can thus proceed with the imputed dataset.

## 
##  iter imp variable
##   1   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   2   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   3   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   4   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   5   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   6   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   7   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   8   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   9   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS
##   10   1  pH  ResidualSugar  Chlorides  FreeSulfurDioxide  Alcohol  TotalSulfurDioxide  Sulphates  STARS

2.3. Data transformation: Box-Cox Method

From the density plots below we can also see that after the transformation and imputation several predictors do not show a normal distribution.

After iterating on power transformations, the following predictors were transformed using the Box-Cox method: FixedAcidity, VolatileAcidity, CitricAcid, TotalSulfurDioxide, Sulphates.
The \(\lambda\) values applied are shown in the table below.

Lambda coefficients for the Box-Cox transformation
FixedAcidity VolatileAcidity CitricAcid TotalSulfurDioxide Sulphates
0.9616419 1.110294 0.9378151 0.9779128 1.002414

Plotting the distributions of the Box-Cox transformed variables, we can see the that their distributions have become closer to normal.

This has resulted in improved correlation with the response for the transformed variables, as displayed in the plot below.

3. BUILD MODELS

In this step, multiple regression models are be built to predict the TARGET count of wine cases ordered.

The models are built on the 80% sample of the training data, and the remaining 20% are used to assess the model performance on out-of-sample data in order to avoid choosing an overfitting model as the best model. The model performance will be compared to each other using RMSE as a metric of prediction quality.

3.1. Build poisson regression models

Poisson regresison models count data. We build two models: the full model, and a model with the following set of predictors that showed at least moderate correlation with the response: LabelAppeal, STARS, AcidIndex, VolatileAcidity, TotalSulfurDioxide, Alcohol. This subset covers both the bottle appeal, the taste rating, and some of the main chemical properties of a wine.

3.1.1. Poission model 1: Full Model

The model in-sample performance is provided below.

Model summary and performance

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8628  -0.5105   0.2128   0.6343   2.6883  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.108e+00  2.271e-01   9.286  < 2e-16 ***
## Density            -3.406e-01  2.174e-01  -1.566 0.117308    
## LabelAppeal         2.148e-01  6.628e-03  32.413  < 2e-16 ***
## AcidIndex          -1.263e-01  4.993e-03 -25.301  < 2e-16 ***
## pH                 -1.815e-02  8.358e-03  -2.171 0.029921 *  
## ResidualSugar      -1.731e-05  2.279e-04  -0.076 0.939455    
## Chlorides          -3.259e-02  2.458e-02  -1.326 0.184898    
## FreeSulfurDioxide   1.206e-04  5.207e-05   2.316 0.020557 *  
## Alcohol             6.101e-03  1.566e-03   3.896 9.79e-05 ***
## STARS               1.691e-01  6.354e-03  26.618  < 2e-16 ***
## FixedAcidity       -1.998e-04  6.142e-03  -0.033 0.974050    
## VolatileAcidity    -2.767e-01  3.766e-02  -7.346 2.04e-13 ***
## CitricAcid          1.362e-01  2.360e-02   5.772 7.82e-09 ***
## TotalSulfurDioxide  2.759e-02  4.951e-03   5.573 2.51e-08 ***
## Sulphates          -1.017e-01  2.847e-02  -3.574 0.000352 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18338  on 10237  degrees of freedom
## Residual deviance: 14956  on 10223  degrees of freedom
## AIC: 40522
## 
## Number of Fisher Scoring iterations: 5

Model Explained.Variance RMSE
poisson_full 0.1843865 1.652511

From the model summary and the diagnostic plot we can see the following:
1) The errors are not quite normally distributed
2) Several variables are not significant

Interpretation of the regression coefficients for the significant variables

The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.

Variable Coefficient ConfLevel
(Intercept) 2.10831 1.00000
VolatileAcidity -0.27667 1.00000
LabelAppeal 0.21482 1.00000
STARS 0.16914 1.00000
CitricAcid 0.13621 1.00000
AcidIndex -0.12632 1.00000
Sulphates -0.10174 0.99965
TotalSulfurDioxide 0.02759 1.00000
pH -0.01815 0.97008
Alcohol 0.00610 0.99990
FreeSulfurDioxide 0.00012 0.97944

We can interpret the model coefficients as follows:
- VolatileAcidity, AcidIndex, Sulphates, and Chlorides have the strongest negative impact on the response variable. All of these variables describe taste parameters of a wine. The interpretation of their negative impact is that high concentrations of individual components are detrimental to the overall taste. - LabelAppeal, STARS, and CitricAcid have a positive impact on the number of ordered cases. This confirms the idea that bottle design and expert rating positively impact wine sales. Citric acid in small quantities can add citrus notes to wine taste, and is sometimes added to enhance the taste. This can be an explanation of a positive effect this variable has on the number of orders. - While statistically significant, the Alcohol content of a wine has no pratically relevant effect on the number of ordered cases.

3.1.2. Poisson model 2 (Manually reduced set of variables)

For the second model, only the following variables are considered:
LabelAppeal, STARS, AcidIndex, VolatileAcidity, TotalSulfurDioxide, Alcohol.

Model summary and performance

## 
## Call:
## glm(formula = TARGET ~ LabelAppeal + STARS + AcidIndex + VolatileAcidity + 
##     TotalSulfurDioxide + Alcohol, family = poisson, data = df_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7228  -0.5011   0.2192   0.6330   2.7294  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.731607   0.060232  28.749  < 2e-16 ***
## LabelAppeal         0.215639   0.006620  32.574  < 2e-16 ***
## STARS               0.169595   0.006351  26.705  < 2e-16 ***
## AcidIndex          -0.124833   0.004909 -25.427  < 2e-16 ***
## VolatileAcidity    -0.288878   0.037617  -7.679  1.6e-14 ***
## TotalSulfurDioxide  0.028403   0.004944   5.745  9.2e-09 ***
## Alcohol             0.006078   0.001567   3.879 0.000105 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18338  on 10237  degrees of freedom
## Residual deviance: 15017  on 10231  degrees of freedom
## AIC: 40567
## 
## Number of Fisher Scoring iterations: 5
Model Explained.Variance RMSE
poisson_reduced 0.1810526 1.657139

Looking at the model summary and performance on the in-sample data we can see that now all coefficients are highly significant, and the explained variance is reduced only slightly (from 18.35% in the full model to 18.14% in the reduced model), while RMSE stayed virtually the same.

Interpretation of the regression coefficients for the significant variables

Variable Coefficient ConfLevel
(Intercept) 1.73161 1.0000
VolatileAcidity -0.28888 1.0000
LabelAppeal 0.21564 1.0000
STARS 0.16960 1.0000
AcidIndex -0.12483 1.0000
TotalSulfurDioxide 0.02840 1.0000
Alcohol 0.00608 0.9999

The interpretation of the coefficients, their direction and magnitude have stayed the same as in the full model.

3.2. Build negative binomial regression models

Negative binomial regression can be used for over-dispersed count data, that is when the conditional variance exceeds the conditional mean. It can be considered as a generalization of Poisson regression since it has the same mean structure as Poisson regression and it has an extra parameter to model the over-dispersion.[1]
The difference to the Poisson models built above would be in the confidence intervals for the regression coefficients.

3.2.1. Negative binomial model 1 (Manually reduced)

The first negative binomial model considered is a reduced version of the full model where the following variables - the ones not found significant in the full poisson model - are excluded: Density, ResidualSugar, FreeSulfurDioxide, and FixedAcidity.

The model in-sample performance is provided below.

Model summary and performance

## 
## Call:
## glm.nb(formula = TARGET ~ . - Density - ResidualSugar - FreeSulfurDioxide - 
##     FixedAcidity, data = df_train, init.theta = 35493.57398, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8956  -0.5134   0.2156   0.6359   2.7251  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.786339   0.073538  24.292  < 2e-16 ***
## LabelAppeal         0.215126   0.006625  32.471  < 2e-16 ***
## AcidIndex          -0.126740   0.004934 -25.688  < 2e-16 ***
## pH                 -0.018327   0.008359  -2.192 0.028353 *  
## Chlorides          -0.032543   0.024582  -1.324 0.185563    
## Alcohol             0.006091   0.001566   3.889 0.000100 ***
## STARS               0.169015   0.006352  26.608  < 2e-16 ***
## VolatileAcidity    -0.275714   0.037651  -7.323 2.43e-13 ***
## CitricAcid          0.136350   0.023593   5.779 7.50e-09 ***
## TotalSulfurDioxide  0.027616   0.004950   5.579 2.41e-08 ***
## Sulphates          -0.103293   0.028462  -3.629 0.000284 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(35493.57) family taken to be 1)
## 
##     Null deviance: 18336  on 10237  degrees of freedom
## Residual deviance: 14963  on 10227  degrees of freedom
## AIC: 40524
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  35494 
##           Std. Err.:  66351 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -40499.74
Model Explained.Variance RMSE
nb_1 0.1839566 1.653338

As expected, all coefficients provided by the model are statistically significant and are close to the ones estimated by the full poisson model. Model performance is also close to that of the full poisson model.

Interpretation of the regression coefficients for the significant variables

The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.

Variable Coefficient ConfLevel
(Intercept) 1.78634 1.00000
VolatileAcidity -0.27571 1.00000
LabelAppeal 0.21513 1.00000
STARS 0.16901 1.00000
CitricAcid 0.13635 1.00000
AcidIndex -0.12674 1.00000
Sulphates -0.10329 0.99972
TotalSulfurDioxide 0.02762 1.00000
pH -0.01833 0.97165
Alcohol 0.00609 0.99990

We can see that the coefficients for the shared variables between this reduced negative binomial and the full poisson model are only slightly different, which is caused by the absence of several predictors in this model vs. the full model.
The interpretation of the coefficients, their direction and magnitude have stayed the same as in the full model.

3.2.2. Negative binomial model 2 (Only two predictors)

The second negative binomial model is considering only two predictors that are measurable without a chemical analysis of a wine: STARS and LabelAppeal. The goal of building this model is to test it against the first negative binomial model to assess if the chemical composition predictors play an important role.

The model in-sample performance is provided below.

Model summary and performance

## 
## Call:
## glm.nb(formula = TARGET ~ STARS + LabelAppeal, data = df_train, 
##     init.theta = 22381.46543, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5653  -0.4963   0.2734   0.6965   2.4124  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.691926   0.014937   46.32   <2e-16 ***
## STARS       0.184826   0.006291   29.38   <2e-16 ***
## LabelAppeal 0.209143   0.006605   31.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(22381.47) family taken to be 1)
## 
##     Null deviance: 18336  on 10237  degrees of freedom
## Residual deviance: 15867  on 10235  degrees of freedom
## AIC: 41412
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  22381 
##           Std. Err.:  66815 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -41403.93
Model Explained.Variance RMSE
nb_2 0.1346434 1.718995

We can see a reduction in the explained variance in this model vs. the first one (from 18.3% to 13.2%), and a growth of the RMSE. In order to confirm the significance of the chemical variables to the model, we compare the two models using an Likelihood Ratio Test (as the second negative binomial model is nested in the first one) [1,2].

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: TARGET
##                                                                                                                                                                                                                                                    Model
## 1                                                                                                                                                                                                                                    STARS + LabelAppeal
## 2 (Density + LabelAppeal + AcidIndex + pH + ResidualSugar + Chlorides + FreeSulfurDioxide + Alcohol + STARS + FixedAcidity + VolatileAcidity + CitricAcid + TotalSulfurDioxide + Sulphates) - Density - ResidualSugar - FreeSulfurDioxide - FixedAcidity
##      theta Resid. df    2 x log-lik.   Test    df LR stat. Pr(Chi)
## 1 22381.47     10235       -41403.93                              
## 2 35493.57     10227       -40499.74 1 vs 2     8 904.1972       0

From the output we can see that the Likelihood Ratio statistic is very significantly different from zero for the model with more predictors (Probability(LR Stat = 938.57 | LR Stat = 0) = 0). This means that the chemical variables carry relevant information for the prediction of the response.

Interpretation of the regression coefficients for the significant variables

The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.

Variable Coefficient ConfLevel
(Intercept) 0.69193 1
LabelAppeal 0.20914 1
STARS 0.18483 1

Interestingly, the coefficient for the STARS and LabelAppeal variables are only slightly different from the first negative binomial model.
The interpretation of the coefficients, their direction and magnitude have stayed the same as in the full poisson model.

3.3. Build multiple linear regression models

In order to test if the poisson model (log link function) is really the best choice of modeling the relationship between the predictors and the response, we will test two linear models: a full model, and a model that is automatically selected using stepwise backwards approach.

3.3.1. Multiple linear model 1 (full model)

The model in-sample performance is provided below.

Model summary and performance

## 
## Call:
## lm(formula = TARGET ~ ., data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9539 -0.7296  0.3857  1.1241  4.9667 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.6181135  0.6466711   8.688  < 2e-16 ***
## Density            -0.9105715  0.6203816  -1.468 0.142200    
## LabelAppeal         0.6531062  0.0189284  34.504  < 2e-16 ***
## AcidIndex          -0.3359182  0.0127028 -26.444  < 2e-16 ***
## pH                 -0.0463445  0.0239363  -1.936 0.052876 .  
## ResidualSugar      -0.0002924  0.0006533  -0.448 0.654434    
## Chlorides          -0.1064593  0.0700150  -1.521 0.128411    
## FreeSulfurDioxide   0.0003354  0.0001502   2.234 0.025536 *  
## Alcohol             0.0202356  0.0044948   4.502 6.81e-06 ***
## STARS               0.5454545  0.0188179  28.986  < 2e-16 ***
## FixedAcidity       -0.0000681  0.0175237  -0.004 0.996899    
## VolatileAcidity    -0.8091070  0.1083263  -7.469 8.73e-14 ***
## CitricAcid          0.4151552  0.0677770   6.125 9.38e-10 ***
## TotalSulfurDioxide  0.0796650  0.0140483   5.671 1.46e-08 ***
## Sulphates          -0.2967962  0.0819140  -3.623 0.000292 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.65 on 10223 degrees of freedom
## Multiple R-squared:  0.2682, Adjusted R-squared:  0.2672 
## F-statistic: 267.6 on 14 and 10223 DF,  p-value: < 2.2e-16

From the model summary we can see that almost the same variables are not significant for the linear model as for the poisson model.

Model Adj..R.Squared RMSE
lm_full 0.2671715 1.648819

We can see that while the Adjusted R-Squared metric (26.5%) appears higher than the explained deviance of the poisson/negative binomial models, the RMSE of the linear model shows a higher error on the in-sample data.

Interpretation of the regression coefficients for the significant variables

The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.

Variable Coefficient ConfLevel
(Intercept) 5.61811 1.00000
VolatileAcidity -0.80911 1.00000
LabelAppeal 0.65311 1.00000
STARS 0.54545 1.00000
CitricAcid 0.41516 1.00000
AcidIndex -0.33592 1.00000
Sulphates -0.29680 0.99971
TotalSulfurDioxide 0.07966 1.00000
Alcohol 0.02024 0.99999
FreeSulfurDioxide 0.00034 0.97446

The coefficients provided by the linear model are different, as they relate predictors directly to the values of TARGET vs. via an exponent, as in the poisson model. However, the order of importance, the direction and relative magnitudes of the coefficients allow to make the same conclusions as in the full poisson model.

3.3.2. Multiple linear model 2 (Stepwise selection)

The second linear model is built using an automated stepwise model selection approach in which all non-significant predictors are eliminated. The eliminated predictors are shown in the table below.

Variable Relevant.Predictor
Density FALSE
ResidualSugar FALSE
Chlorides FALSE
FixedAcidity FALSE
(Intercept) TRUE
LabelAppeal TRUE
AcidIndex TRUE
pH TRUE
FreeSulfurDioxide TRUE
Alcohol TRUE
STARS TRUE
VolatileAcidity TRUE
CitricAcid TRUE
TotalSulfurDioxide TRUE
Sulphates TRUE

The resulting linear model excludes the following predictors: Density, ResidualSugar, FreeSulfurDioxide, and FixedAcidity.

The model in-sample performance is provided below.

Model summary and performance

## 
## Call:
## lm(formula = TARGET ~ . - Density - ResidualSugar - FreeSulfurDioxide - 
##     FixedAcidity, data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9982 -0.7257  0.3845  1.1253  4.9280 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.755705   0.204763  23.225  < 2e-16 ***
## LabelAppeal         0.654241   0.018926  34.568  < 2e-16 ***
## AcidIndex          -0.337231   0.012504 -26.969  < 2e-16 ***
## pH                 -0.046994   0.023939  -1.963 0.049664 *  
## Chlorides          -0.106300   0.070020  -1.518 0.129012    
## Alcohol             0.020221   0.004495   4.498 6.92e-06 ***
## STARS               0.544964   0.018813  28.967  < 2e-16 ***
## VolatileAcidity    -0.806482   0.108334  -7.444 1.05e-13 ***
## CitricAcid          0.417240   0.067782   6.156 7.76e-10 ***
## TotalSulfurDioxide  0.079655   0.014046   5.671 1.46e-08 ***
## Sulphates          -0.301990   0.081899  -3.687 0.000228 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.65 on 10227 degrees of freedom
## Multiple R-squared:  0.2676, Adjusted R-squared:  0.2669 
## F-statistic: 373.8 on 10 and 10227 DF,  p-value: < 2.2e-16

From the model summary we can see that almost the same variables are not significant for the linear model as for the poisson model.

Model Adj..R.Squared RMSE
lm_stepwise 0.2669304 1.649413

We can see that while the Adjusted R-Squared and RMSE metrics have hardly moved vs. the full model indicating that the excluded predictors did not add any additional explanatory power to the model.

Interpretation of the regression coefficients for the significant variables

The model coefficients (only statistically significant ones) ranked by descending magnitude are provided in the table below.

Variable Coefficient ConfLevel
(Intercept) 4.75570 1.00000
VolatileAcidity -0.80648 1.00000
LabelAppeal 0.65424 1.00000
STARS 0.54496 1.00000
CitricAcid 0.41724 1.00000
AcidIndex -0.33723 1.00000
Sulphates -0.30199 0.99977
TotalSulfurDioxide 0.07966 1.00000
pH -0.04699 0.95034
Alcohol 0.02022 0.99999

The direction and interpetation of the coefficients is the same as in the full linear model.

4. MODEL SELECTION

In this step, the performance of all six models is be compared based on RMSE on the out-of sample data. The best performing model is chosen as the final one.

model RMSE
Poisson full 1.660830
Poisson reduced 1.663281
Neg. binom. 1 (large) 1.661722
Neg. Binom 2 (2 variables) 1.730915
Linear full 1.665180
Linear stepwise reduced 1.666590

We can see that on out-of-sample data, the linear model with the automatically selected set of predictors has performed on par with the full linear and full poisson models.

Comparing the fitted values against test data we can see that the linear model is fairly constant in the errors at each level of the response. However, it fails to capture the highest levels (count of 7 and 8 cases). The poisson model has a visibly higher error on the zero count of the response, but does provide several predictions for the highest levels of demand.

Due to its simplicity in terms of the number of predictors and the interpretation, the reduced linear model is chosen as the best one for generating predictions on the evaluation data set.

However, further tuning of the negative binomial family models (e.g. using zero-inflated model) could provide better precision of the predictions, especially for the lower end of the distribution of the target variable.

Predictions on the evaluation dataset

The evaluation dataset is transformed in the same way as the training dataset in order to provide correct predictions. NA values in the evaluation data will cause missing predictions.

Predictions on the evaluation dataset are made using the model lm_stepwise.

The output of the model on the evaluated data is available under the following URL:

Appendix

The full R code for the analysis in Rmd format is available under the following URL:

Reference

  1. Negative Binomial Regression | R Data Analysis Examples. UCLA: Statistical Consulting Group. https://stats.idre.ucla.edu/r/dae/negative-binomial-regression/ (Accessed on May 12, 2018).
  2. Faraway, J. J. (2006). Extending the linear model with R: Generalized linear, mixed effects and nonparametric regression models.